Skip to content

Reduce_mul implementation#1132

Closed
emrys53 wants to merge 2 commits intoxtensor-stack:masterfrom
emrys53:master
Closed

Reduce_mul implementation#1132
emrys53 wants to merge 2 commits intoxtensor-stack:masterfrom
emrys53:master

Conversation

@emrys53
Copy link
Contributor

@emrys53 emrys53 commented Jul 1, 2025

While using XSIMD in our library, we realized that a dedicated reduce_mul function was missing. This PR introduces a reduce_mul implementation for all architectures and supported types, with architecture-specific optimizations for SSE, AVX, and AVX512 instruction sets. Relevant unit tests are included.

I was unable to test the AVX512-specific implementation on my machine, which only supports AVX2. For AVX512, I used mm512_reduce_mul{type} intrinsics. According to Intel's documentation, these are sequential rather than parallel. Since I couldn't benchmark or compare alternative implementations, I left it as-is. I'm open to suggestions for improvement.
Motivation

Although XSIMD provides a general reduce function, using it with std::complex or std::complex fails, as as_unsigned_integer_t is not specialized for complex types. I'm not sure whether this is intended behavior, but I wanted to mention it for context. Providing a reduce_mul directly avoids this issue.

This is my first contribution to XSIMD, and I’ve only been using the library for about a week. Feedback is very welcome, and I’m happy to adjust or improve the code based on your review.


// hmul
template <class A, class T, class /*=typename std::enable_if<std::is_integral<T>::value, void>::type*/>
XSIMD_INLINE T hmul(batch<T, A> const& self, requires_arch<common>) noexcept
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this just be the generic implementation of reduce_mul, and if so, why not using the current generic reduction implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only used for integral types with sizeof(T) < 4. The reason I implemented like this was because for example when testing reduce_mul for uint8_t with AVX machine. What happens is that it has 32 elements which means general reduce function multiplies and swizzle the batch 5 times (32->16->8->4->2->1) and in the end it calls self.get(0) (which is just too inefficient anyway issue #1133). So I thought it would be more efficient to implement it this way. What it does with uint8_t in my implementation is that it first splits the batch into 2 equal sse4_2 batches multiplies them together and then calls hmul.

Copy link
Contributor Author

@emrys53 emrys53 Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the swizzle function for uint8_t is just loading the batch to a buffer and storing the result back. So using that 5 times (in AVX) is just too inefficient compared to split_avx function which splits it using SIMD intrinsics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serge-sans-paille you haven't responded back. If you are open to the idea of reduce_mul implementation. I can fix the code so that it passes all the tests. If you have feedbacks or recommendations to the code I have written, you can write it here and I can fix it after we discuss.

@DiamonDinoia
Copy link
Contributor

I'm also interested in this. @emrys53 could you rebase master?
@serge-sans-paille what do you think needs to be changes to integrate it?

serge-sans-paille added a commit that referenced this pull request Aug 20, 2025
It changes nothing for clang which has a nice shuffle optimizer, but it
does help gcc a lot.

Somehow related to #1132
serge-sans-paille added a commit that referenced this pull request Aug 20, 2025
It changes nothing for clang which has a nice shuffle optimizer, but it
does help gcc a lot.

Somehow related to #1132
serge-sans-paille added a commit that referenced this pull request Aug 21, 2025
It changes nothing for clang which has a nice shuffle optimizer, but it
does help gcc a lot.

Somehow related to #1132
serge-sans-paille added a commit that referenced this pull request Aug 23, 2025
This is a generalization of #1132 by @emrys53. Part of the Intel code is
strongly inspired by the work from #1132, with some minor nits.
@serge-sans-paille
Copy link
Contributor

@emrys53 I've submitted #1163 as a port of this patch for other targets, as well as some support for other architectures. I was strongly motivated and inspired by your work, so please don't take it as an offense, rather as a tribute!

serge-sans-paille added a commit that referenced this pull request Aug 23, 2025
This is a generalization of #1132 by @emrys53. Part of the Intel code is
strongly inspired by the work from #1132, with some minor nits.
serge-sans-paille added a commit that referenced this pull request Aug 23, 2025
This is a generalization of #1132 by @emrys53. Part of the Intel code is
strongly inspired by the work from #1132, with some minor nits.
serge-sans-paille added a commit that referenced this pull request Aug 23, 2025
This is a generalization of #1132 by @emrys53. Part of the Intel code is
strongly inspired by the work from #1132, with some minor nits.
serge-sans-paille added a commit that referenced this pull request Aug 23, 2025
This is a generalization of #1132 by @emrys53. Part of the Intel code is
strongly inspired by the work from #1132, with some minor nits.
@JohanMabille
Copy link
Member

Closing this in favor of #1163, thanks a lot @emrys53 for initiating the work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants